Use machine learning models to identify and assess risk factors for coronary artery disease

Accurate prediction of coronary artery disease (CAD) is crucial for enabling early clinical diagnosis and tailoring personalized treatment options. This study attempts to construct a machine learning (ML) model for predicting CAD risk and further elucidate the complex nonlinear interactions between the disease and its risk factors. Employing the Z-Alizadeh Sani dataset, which includes records of 303 patients, univariate analysis and the Boruta algorithm were applied for feature selection, and nine different ML techniques were subsequently deployed to produce predictive models. To elucidate the intricate pathogenesis of CAD, this study harnessed the analytical capabilities of Shapley values, alongside the use of generalized additive models for curve fitting, to probe into the nonlinear interactions between the disease and its associated risk factors. Furthermore, we implemented a piecewise linear regression model to precisely pinpoint inflection points within these complex nonlinear dynamics. The findings of this investigation reveal that logistic regression (LR) stands out as the preeminent predictive model, demonstrating remarkable efficacy, it achieved an Area Under the Receiver Operating Characteristic curve (AUROC) of 0.981 (95% CI: 0.952–1), and an Area Under the Precision-Recall Curve (AUPRC) of 0.993. The utilization of the 14 most pivotal features in constructing a dynamic nomogram. Analysis of the Shapley smoothing curves uncovered distinctive “S”-shaped and “C”-shaped relationships linking age and triglycerides to CAD, respectively. In summary, machine learning models could provide valuable insights for the early diagnosis of CAD. The SHAP method may provide a personalized risk assessment of the relationship between CAD and its risk factors.


Introduction
Coronary artery disease (CAD), a prevalent form of heart disease (HD), arises when the coronary arteries (i.e., blood vessels), which supply oxygen and nutrients to the heart muscle, become narrowed or blocked [1].According to the Global Health Estimates 2019 by the World Health Organization (WHO), HD has been the primary cause of death for the past 20 years, remaining responsible for approximately 9 million deaths in 2019, corresponding to 16% of all deaths [2].The pathogenesis of CAD is complex, primarily driven by atherosclerosis within the coronary arteries [3].This condition is characterized by the accumulation of cholesterol and calcium deposits on the vessel walls, leading to the formation of atherosclerotic plaques that impede blood flow to the heart [3,4].Other predisposing factors include age, gender, hypercholesterolemia, smoking, hypertension, and diabetes [5,6].Although CAD is considered to have a multifactorial etiology, there is limited knowledge regarding how different factors contribute to risk.CAD brings a series of serious psychological and economic burdens for affected individuals.It is necessary to develop predictive models to identify risk factors for CAD and to evaluate the relationship between potential risk factors and CAD.This would provide diagnostic aids for physicians in clinical practice and further save patients' lives.
Machine learning (ML) models, grounded in data-driven methodologies, have demonstrated significant success within the medical field, offering novel insights into clinical diagnosis [7].These ML approaches present a distinctive analytical framework for the development of cost-effective interventions and advancements in disease prevention [1,8], Furthermore, they hold the potential to identify key factors and conduct risk assessments [9].Despite the absence of a benchmark for comparing and analyzing machine learning features, methods, and algorithms in CAD diagnosis [10], multiple studies have corroborated the advantages of models based on machine learning approaches, such as [11][12][13][14][15][16].Alizadehsani et al. [17] applied a feature engineering method to increase model performance, compared with other methods, SVM achieved the highest AUC (0.92).Cu ¨vitoğlu [18] used Principal Component Analysis to reduce the dimensions of the feature space and built an ensemble learner, which achieved an AUC of 0.83.Zhang et al. [19] applied five different class balancing techniques to balance the dataset, with the highest AUC (0.93) obtained by LightGBM using all features.Sayadi et al. [20] compared the performance of six ML models using three feature selection methods, and the results showed that both SVM and LR had nearly the same AUC at 0.98.Suryani et al. [21] proposed a feature selection method considering costs, with the DNN (Deep Neural Network) model achieving an AUC of 0.973 with 20 features.Das et al. [22] adopted the minimum Redundancy Maximum Relevance (mRMR) feature selection method, and the random forest (RF) model achieved an AUC of 0.911 based on 21 features.These models still have the potential to further enhance performance.Other literatures, such as [11,17,23,24], focuses on improving accuracy but doesn't report AUROC or AUPRC, which may be misleading and less credible in imbalanced datasets.While many machine learning models (e.g., random forest (RF) and XGBoost, etc.) excel at predictive performance, their decision-making processes are difficult to interpret [25], which may hinder their application in practical clinical settings.A popular and transparent model interpretability method is SHAP (SHapley Additive exPlanations), which is widely used in the existing literature [26][27][28][29].Understanding the potential reasons for a prediction model can guide and help clinicians understand the basis of decisions.Moreover, critical risk variables that affect the development of CAD have not received enough attention.
Therefore, careful investigations of CAD risk factors and comprehensive risk assessment are necessary.This study aimed to identify critical risk factors and establish a predictive model for CAD.Recognizing that different ML algorithms may exhibit varied efficacy across specific problems, we employed a diverse array of ML techniques to construct a risk prediction model.The performance of these models was meticulously compared to determine the ML model offering optimal accuracy and clinical utility.Additionally, we utilized the SHAP approach to elucidate the nonlinear relationships between risk factors and CAD and to assess the inflection points within these relationships.To our knowledge, few studies have explored this topic using Shapley smoothing curves fitting.

Data resources
The Z-Alizadeh Sani dataset, designed for CAD diagnosis, encompasses records from 303 patients collected at the cardiovascular center of Shahid Rajaei Hospital.This dataset, comprising 59 features and is categorized into 216 CAD instances, with the remainder classified as healthy.These features are segmented into four categories: demographic, symptom and examination, ECG (electrocardiogram), laboratory test and echo (echocardiogram).Each patient's diagnosis falls into one of two potential categories: Normal or CAD.Diagnostic angiography of three arteries-the left anterior descending (LAD), left circumflex (LCX), and right coronary artery (RCA)-was conducted.A diagnosis of CAD was confirmed if at least one of these arteries exhibited stenosis, defined as narrowing of at least 50%.Patients not meeting this criterion were classified as Normal.This dataset is freely accessible through the UCI Machine Learning Repository for academic research purposes.

Data preprocessing
The dataset contained no missing values.To address potential outliers, the Interquartile Range (IQR) method was employed.The IQR is defined as the difference between the third quartile (Q3) and the first quartile (Q1).Data points falling below Q1-1.5IQR or above Q3+1.5IQRwere considered outliers and replaced with Q1-1.5IQR or Q3+1.5IQR, respectively.Prediction targets were CAD onset (Yes/No), LAD, LCX, and RCA were used to determine CAD, so we removed these attributes.Feature selection was performed to extract representative feature subsets from the original data, aiming to build models with optimal performance while reducing computational costs.Due to potential correlation between length, weight, and BMI, length was deleted.The constant feature "Exertional.CP" (Exertional Chest Pain) was also removed as it provided no discernible information.Univariate feature analysis (i.e., T-test and Chi-square test) and Boruta [30] feature importance were used for selecting relevant features.Boruta is a feature selection algorithm that performs feature selection by assessing the significance of original features against randomly generated shadow features.Feature importance is determined based on the Z-score of its importance measure compared to the importance of shadow features.More important attributes receive higher importance scores.Boruta leverages the random forest classifier's principle of adding randomness to the system and aggregating results from multiple random samples.This approach mitigates the misleading effects of random fluctuations and identifies truly relevant features.
While SHAP analysis offers significant insights into feature importance and impact, it does not directly quantify the relationships between variables.Recognizing this limitation, we incorporated generalized additive models (GAMs) to fit curves between variables and their corresponding SHAP values.This methodological enhancement allows for a more nuanced exploration of how features influence model predictions.The application of GAMs in our analysis serves two primary purposes.Firstly, it enables the identification of critical points where the SHAP value (y) equals zero for a given feature value (X).This critical point is significant because it marks a threshold beyond which the feature's contribution to the model's outcome may change direction-from positive to negative or vice versa.Secondly, GAMs help identify inflection points on the curve, signaling a change in the relationship between a feature and the model's outcome.Specifically, these inflection points indicate where the odds ratio (OR), relative risk (RR), or hazard ratio (HR) values before and after the point differ.Incorporating GAMs for curve fitting thus allows us to uncover and interpret key points in the relationship between features and their SHAP values, offering a more detailed and nuanced understanding of feature contributions.This approach enhances the interpretability of machine learning models, particularly in the context of our study, by providing a clearer picture of how individual features influence model predictions.

Model development and evaluation
The dataset was randomly partitioned into training and testing sets at an 8:2 ratio for multimodel analysis on the selected features.To mitigate the influence of randomness on outcomes, 10-fold cross-validation was employed.The average performance on the test set was calculated.Recognizing the specificity of ML algorithms to particular problem domains, we explored nine ML models to identify the optimal predictive framework: Logistic Regression (LR), eXtreme Gradient Boosting (XGBoost), Adaptive Boosting (AdaBoost), Random Forest (RF), C5.0, Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Artificial Neural Network (ANN), and Naïve Bayes (NB).To address class imbalance, the Youden Index (sensitivity + specificity-1) was used to determine the optimal prediction probability cutoff.Model discrimination was evaluated using the Area Under the Receiver Operating Characteristic Curve (AUROC) and the Area Under the Precision-Recall Curve (AUPRC), facilitating a nuanced assessment of model performance in imbalance datasets.Additionally, decision curve analysis (DCA) and model calibration were conducted.SHAP analysis quantified each risk factor's contribution to predictions, while the Partial Dependence Plot (PDP) illustrated the relationship between the target variable and features of interest.A generalized additive model (GAM) elucidated nonlinear risk factor correlations with the target, enhancing understanding of threshold and saturation effects.Piecewise linear regression was used to identify inflection points in nonlinear trends, segmenting continuous risk factors into subgroups based on these points.

Statistical analysis
All descriptive statistical analyses were performed using R version 4.2.0.Patients were divided into two groups based on whether they had CAD.Continuous variables were displayed as means and standard deviations (SD) using the T-test, categorical variables were expressed as frequencies and percentages using the chi-square test, p < 0.05 was considered statistically significant.Boruta was used to capture and rank the importance of features with respect to the outcome variable.Univariate and multivariate logistic regression were used to build crude and adjusted models, respectively, along with further trend analysis.

Characteristics of baseline data and the selection of critical factors
Table 1 presents the distribution of characteristics for the study cohort, which was comprised of 216 CAD and 87 Normal.The average age of the CAD population was older than that of the normal population (53.24±8.93),there was 130 (60.2%) males, which was higher than that of females.The Boruta algorithm was used to rank features importance, as shown in Fig 1 .The presence of multiple features with negative importance values indicates that these features were deemed less significant than the average shadow feature and were consequently excluded from the final model by the Boruta algorithm.14 variables (green boxplots) were confirmed to be important according to the Boruta feature selection method.Therefore, a total of 14 variables were selected for modeling, considering both the p-values from univariate analysis (p < 0.05 in Table 1) and the results of the Boruta.The final feature set comprised 6 numeric

Performance comparison of different ML methods
After identifying 14 relevant factors, we divided the data into training set and test set (the ratio is 8:2).We implemented nine machine learning models (LR, RF, AdaBoost, SVM, C5.0, XGBoost, NB, KNN, and ANN) to predict CAD.Model performance was evaluated using AUPRC and AUROC, and the Decision curves and Calibration curves were also analyzed.The decision curve showed that the net benefit of all models decreases as the threshold probability increased, the LR model consistently performed best, which also proved that the LR model had the best clinical suitability (Fig 2A).Meanwhile, the calibration curve (The closer the Apparent line is to the dashed line, the better the agreement between the predicted and actual values is) showed that LR exhibited the best fit between the actual diagnosis and the predicted diagnosis (Fig 2B).And, LR also showed the highest classification capability and predictive reliability, achieving the highest AUROC (area = 0.981) and AUPRC (area = 0.993) on the test set (Fig 2C and 2D).Others metrics were also optimal for LR (S1 Table ), Accuracy, Recall and Precision were 0.951, 0.955, and 0.977 respectively.In order to prevent overfitting, a 10-fold CV was also used for training model, Table 2 summarized the mean of model performance.LR maintained good predictive values, with an AUROC of 0.932, and showed the second highest AUPRC (0.964).RF achieved the second highest AUROC after LR, but had the highest AUPRC (0.967).Detailed evaluation metrics, including Accuracy, Recall, Precision, F1-Score and Specificity, were presented in this table.

Construction of nomgrams and risk factor assessments
Focused on the competitive LR model, a nomograph was constructed based on the coefficients of the variables (Fig 3).The risk probability of CAD can be calculated based on the sum of the patient's scores for each factor in the nomograph.Higher scores were associated with a higher risk of CAD.The dynamic nomogram is accessible online at https://mingyangkeyan.shinyapps.io/DynNomapp/.Two models (Univariate LR and Multivariate LR) were constructed to examine the relationship between risk factors and CAD.Crude odds ratio (cOR) and adjusted odds ratio (AdjOR) were shown in Table 3. Age, TG, DM, Typical Chest Pain, T inversion, and Region RWMA were considered statistically significant (p < 0.05) in both Univariate LR and Multivariate LR model.OR values of greater than 1 for these factors suggested that the risk of CAD increases with the exposure of these factors.Among these factors, Typical Chest Pain was identified as the most dangerous factor.We noted that the cOR of VHD was less than 1, this means VHD was a protective factor, while AdjOR was greater than 1, this means that VHD may be associated with an increased risk of CAD after accounting for other potential confounders.Regrettably, the disadvantage of LR model is that the interpretation is difficult because the interpretation of the weights β (OR = e β ) is multiplicative and not additive.Given that RF and LR have similar advantages in predictive power, and especially the discrimination (AUPRC, area = 0.967) of RF in unbalanced data by performing 10-fold.In order to provide visual explanations for the selected variables, SHAP was used to analyze RF model.and low risk respectively.For example, patients with Typical Chest Pain generally have a higher risk of CAD, and the risk of CAD increases with age.Moreover, this figure also shows the feature importance ranking, the importance diminishes from top to bottom, with Typical Chest Pain exhibits the most potent predictive power and followed by Age.Furthermore, we also analyzed the marginal effect on CAD risk of two continuous variables in detail using SHAP values, the PDP of Age and TG were displayed in Fig 5 .The "S"-shaped and "C"-shaped curve were discovered on the exposure-response relationship between critical factors and CAD risk.The smooth curve fitting revealed a threshold effect and saturation effect on PDP.Initially, CAD prevalence showed an overall increasing trend with Age or TG, and then tended to flatten out.When age exceeded 57 years or TG exceeded 130mg/dl, these factors exhibited a positive effect on CAD occurrence.Further, we used piecewise linear regression to find the inflection point of the non-linear connections, and divided into subgroups according to these observe point.We found that 62 years was the inflection point for Age, and 83 and 186 mg/dl were the inflection point for TG.The effects of Age and TG at different  segments on CAD risk were shown in Table 4.The risk of CAD significantly increased by 284% when age exceeded 62 years, with an odds ratio (OR) of 3.84 (95% CI: 2.12-7.35)compared to those aged less than 62 years.When TG ranged from 83 to 186 mg/dl, the risk of CAD increased by 60.9% compared to TG<83mg/dl, but the association was not statistically significant.When TG exceeded 186 mg/dl, the risk of CAD significantly increased from 1.609 to 3.236.Other risk factors Shapley curves are showed in S1 and S2 Figs.

Discussion
This study identified 14 critical features for assessing CAD risk using ML models.Compared to other models in this study, LR exhibited superior discriminative capabilities.This work not only constructed a prediction models, but also investigated the nonlinear relationship between risk factors and CAD by SHAP values, including an analysis of cutoff points, an aspect that has received limited attention in previous studies.This work can not only improve the interpretability of the model but also enrich medical knowledge by quantifying the underlying mechanisms of disease development.Furthermore, this study delves into the identification of thresholds and inflection points for risk factors, empowering clinical decision-makers to devise more targeted strategies and interventions with greater precision.Diversity is a key property of medical datasets.Although it provides rich and comprehensive information, it also significantly increases the complexity of medical data mining.To identify risk factors for CAD, extracting more informative medical knowledge is crucial for improving the accuracy and stability of predictive models.Through feature selection, we identified several crucial factors for CAD diagnosis, these features were included: 4 Demographic (Age, BP: Blood Pressure, DM: Diabetes Mellitus, HTN: Hypertension); 6 Laboratory and Echo (FBS: Fasting Blood Sugar, TG: Triglyceride, ESR: Erythrocyte Sedimentation Rate, EF TTF: ejection fraction, Region RWMA: regional wall motion abnormality, VHD: valvular heart disease); 3 Symptom and Examination (Typical Chest Pain, Atypical, Nonanginal); 1 ECG (T inversion).The current literature consistently demonstrates a relationship between these covariates and coronary artery disease (CAD).Advancements in information technology have made analyzing patients' biochemical characteristics for CAD risk assessment increasingly feasible and cost-effective.Among the models developed using these features, Logistic Regression (LR) and Random Forest (RF) have consistently shown robust performance.Due to its simplicity and ease of application in clinical settings, we chose to develop a nomogram based on LR.This nomogram, incorporating 14 variables, aims to assist clinicians in making informed decisions regarding the management and treatment of CAD patients.Furthermore, RF achieved superior outcomes through 10-fold cross-validation, suggesting that the selected set of features effectively discriminates against CAD.This finding enhances the reliability and practical utility of our results in a clinical context.
Exploring risk factor trends is crucial for reducing the prevalence of CAD.This study focused on two continuous variables: age and triglyceride (TG) levels, aiming to elucidate how CAD risk fluctuates with changes in these parameters.We employed Shapley values to visually delineate the underlying mechanisms of CAD etiology.These insights are crucial for clinicians to evaluate the potential for adverse outcomes and to strategize preemptive interventions.The relationship between age and CAD outcomes diverges depending on the cutoff value.When age exceeded 57, a positive effect was observed, indicating a higher risk for the elderly as age increases [31,32].This may be associated with the deterioration of physical functions and complicated diseases.TG was considered an independent risk factor for CAD [33], high TG levels significantly increase CAD risk, with a positive effect observed when TG levels exceed 130 mg/dL.This suggests that maintaining TG levels below 130mg/dL may effectively reduce CAD risk [33,34].In this study, the breakpoints of the non-linear connection between TG and CAD were estimated to range from 83 to 186 mg/dL, which were inconsistent with existing literature.Indeed, TG is regulated by multi-facet factors [35,36], these include inactivity, smoking, drinking too much alcohol, and being overweight, etc. [37].CAD detection relies on a multitude of variables.For instance, high BP significantly elevates CAD risk [38,39], and HTN is a major contributor to the development [39].DM heightens susceptibility to CAD [40], and even in the absence of DM, FBS levels pose a risk.Chronic hyperglycemia can lead to arterial hardening, thereby increasing CAD risk [41].The ESR provides diagnostic information about CAD [42], ESR was used to assess inflammatory status, which may be associated with CAD events.EF is utilized to assess cardiac function and monitor cardiovascular disease, where a reduction in EF signifies increased predictive value for CAD [43].Additionally, Region RWMA [44], valvular heart disease [45], Typical Chest Pain [46], Atypical, Nonanginal [47], T inversion [48] have been recognized as predictors of CAD in prior studies.This study possesses several strengths.Firstly, compared to other studies utilizing the same dataset [17][18][19][20][21][22], we achieved high performance while selecting relatively fewer features.Secondly, we introduced a dynamic nomogram for rapid and accurate assessment of CAD risk, thereby enhancing clinical decision-making capabilities.Thirdly, while prevailing machine learning investigations predominantly focus on model performance and risk factor identification, they frequently overlook the detailed influence of these factors on CAD.To address this gap, we employed Shapley values to elucidate the non-linear relationships between risk factors and CAD, shedding light on the threshold effects and contributing to a more nuanced understanding of CAD pathogenesis.Lastly, we innovatively applied a piecewise linear regression model to estimate breakpoints in these non-linear relationships, offering novel clinical insights that may inform the development of targeted intervention strategies.
This study has several limitations.Firstly, while these factors may be good predictors, clinicians are often more concerned about causation.Future research should focus on unraveling the causal relationships between these variables and CAD, which would contribute to precision medicine.Secondly, although our model offers valuable insights for clinical diagnosis, treatment, and the identification of potential risk factors for CAD, its applicability requires further validation with external datasets to confirm its generalizability and accuracy.Thirdly, the Z-Alizadeh Sani dataset is a public and single-center database.Its limited sample size may result in inadequate training of predictive models, and data class imbalances may lead to model bias.To enhance the model's robustness and reliability, future efforts should focus on incorporating a broader and more diverse patient population and addressing the issue of class imbalances within the dataset.Finally, future research should integrate imaging data with other metadata for faster and more accurate disease diagnosis.

Conclusion
Machine learning models are an effectively tool for predict the risk of CAD.Utilizing Shapley values and generalized additive models may reveals complex nonlinear interactions, aiding in early diagnosis and personalized treatment.

Fig 4
shows the 14 most important features.Each row represents a feature, with SHAP values for each sample plotted as different colored dots, where the red and blue dots represent high risk

Fig 1 .
Fig 1. Feature selection using Boruta algorithm: Variable importance plot.Red, yellow, green, and blue boxplots represent Z scores of rejected, tentative, confirmed and shadow attributes respectively.Shadow (minimum, mean, and maximum) features are reference points for deciding which attributes are truly important and these values are generated by the algorithm via shuffling values of the original attributes.Variables extracted from the 14 confirmed important, 36 confirmed unimportant and 3 tentative features.Weak Peripheral Pulse: WPP; Diastolic Murmur: DiaM; Typical Chest Pain: TCP; Poor R Progression: PRP.https://doi.org/10.1371/journal.pone.0307952.g001